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ABSTRACT 

Optimal use of computing resources requires extensive cod¬ 
ing, tuning and benchmarking. To boost developer produc¬ 
tivity in these time consuming tasks, we introduce the Ex¬ 
perimental Linear Algebra Performance Studies framework 
(ELAPS), a multi-platform open source environment for fast 
yet powerful performance experimentation with dense lin¬ 
ear algebra kernels, algorithms, and libraries. ELAPS al¬ 
lows users to construct experiments to investigate how per¬ 
formance and efficiency vary depending on factors such as 
caching, algorithmic parameters, problem size, and paral¬ 
lelism. Experiments are designed either through Python 
scripts or a specialized GUI, and run on the whole spec¬ 
trum of architectures, ranging from laptops to clusters, ac¬ 
celerators, and supercomputers. The resulting experiment 
reports provide various metrics and statistics that can be an¬ 
alyzed both numerically and visually. We demonstrate the 
use of ELAPS in four concrete application scenarios and in 
as many computing environments, illustrating its practical 
value in supporting critical performance decisions. 

General Terms 

Performance, Experimentation 

Keywords 

performance experiments, dense linear algebra 

1. INTRODUCTION 

The field of high performance computing is largely concerned 
with the optimal usage of available resources. Since per¬ 
formance depends on the choice of algorithms, parameters, 
libraries and even computing environment, maximizing effi¬ 
ciency is a task that comes at the cost of extensive coding, 
tuning and benchmarking. To facilitate and support such 
time-consuming and repetitive activities within the devel¬ 
opment of dense linear algebra software, we propose a rich 
and flexible environment for rapid performance experimen¬ 
tation. 
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The Experimental Linear Algebra Performance Studies frame¬ 
work (ELAPS) allows users to create experiments for inves¬ 
tigating how performance and efficiency depend on factors 
such as caching, algorithmic parameters, problem size, and 
parallelism. Experiments are designed by combining one or 
more algorithmic constructs commonly encountered in lin¬ 
ear algebra computations, and built either through Python 
scripts or a specialized and intuitive GUI. They then can 
be executed either locally or through batch-job systems, on 
hardware ranging from laptops and accelerators to clusters 
and supercomputers. Finally, the results can be visualized 
and analyzed interactively, in terms of various performance 
metrics and statistics. 

As demonstrated in this paper by means of examples rais¬ 
ing in actual applications, insights gained through ELAPS 
serve as a solid ground to make performance relevant design 
decisions. 

The remainder of this paper is structured as follows: Sec. 2 
introduces the experimental features supported by ELAPS; 
the framework, together with its structure and implemen¬ 
tation are described in Sec. 3. Finally, Sec. 4 demonstrates 
the use of ELAPS as a decision-making aid in a series of 
application examples. 

Related Work 

Performance optimization is a widespread activity, impact¬ 
ing virtually all scientific computing disciplines; out of many 
works, here we mention three examples in the field of lin¬ 
ear algebra that are aligned with the studies enabled by 
ELAPS: the optimization of the algorithmic block size for 
LAPACK’s routines [22], the study of symmetric tridiagonal 
eigensolvers [8], and the construction of algorithms for the 
inversion of symmetric positive definite matrices [3]. 

A popular approach for performance optimization is the 
“anto-tuning”: On the one hand, domain-specific libraries 
such as ATLAS [23] and FFTW [10] perform an automatic 
search (with or without explicit timing) to deliver hardware- 
specific code; on the other hand, general-purpose languages 
and libraries such as Active Harmony [7], Atune-IL [19], 
and Chapel [6] make the exploration of a parameter space 
an integral part of the computing environment. A solution 
that combines automation with machine learning techniques 
to offer on-line selection of algorithms is proposed in [12]. In 
constrast with automated solutions, ELAPS’ objective is to 
enable interactive and insightful experimentation. 


Many application-level tools for profiling and analyzing ex¬ 
isting codes exist (examples include PAPI [5], Tau [20], 
Vampir [15], and Scalasca [11]); collectively, they offer 
support for the whole range of architectures, from single 
computing nodes to large distributed computers. In its cur- 
ret form, ELAPS targets shared-memory platforms, and early 
experimentation. 

2. EXPERIMENTS 

While performance experiments come in all kinds of shapes 
and sizes, many of them can be described by a few common 
features. Within the ELAPS framework, we combine and 
generalize such features to provide a versatile central con¬ 
cept of “experiment”. In this section, we discuss these ba¬ 
sic features guided by deliberately simple examples. More 
complicated examples arising in actual applications are then 
presented in Sec. 4. 

We begin with a most elementary experiment: Measuring 
the performance of the matrix-matrix product kernel dgemm. 



As shown schematically above, this experiment runs on one 
core of an Intel SandyBridge E5-2670 processor, using the 
OpenBLAS library [16], and executes the double precision 
kernel dgemm once on random square matrices of size 1000. 
Although simple, similar experiments are commonly used 
to determine the attainable peak performance of a given 
processor. 

When combined with additional information on the hard¬ 
ware and the kernel’s complexity, the raw timing (in cycles) 
from this experiment leads to a number of metrics, which 
yield more insights into how efficiently the CPU is used. 


metric 

value 

cycles 

272 551028 

time [ms] 

104.8 

Gflops/s 

19.1 

flops / cycle 

7.3 

efficiency [%] 

91.7 


Furthermore, if available, the Performance Application 
Programming Interface (PAPI) [5] allows one to access 
useful hardware counters. 


metric 

counter name 

value 

Level 1 cache misses 

PAPI. 

T.1 TCM 

32 933 961 

Conditional branch 
instructions mispredicted 

PAPI. 

.BR_MSP 

3941 


With ELAPS, all these metrics are readily available and 
easily extensible. 
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Figure 1: Performance statistics from 10 repetitions 
of dgemm 

2.1 Repetitions and Statistics 

Multiple executions of a kernel often result in fluctuating 
timings; the reasons for such differences include library ini¬ 
tialization overhead, cache locality, and system jitter. As 
customarily done, in ELAPS this issue is addressed by re¬ 
peating each experiment several times, and by collecting 
statistics. As an example, let us consider an experiment 
that repeats the kernel execution from Experiment 1 ten 
times on the same input matrices (i.e., the same memory 
locations): 



This second experiment produces ten measurements, from 
which we derive statistics as those presented in Fig. 1. It is 
worth pointing out that whenever multiple repetitions are 
executed and timed, the first one almost inevitably repre¬ 
sents an outlier; for the most part, this phenomenon is con¬ 
nected to the initialization of the kernel library, but it is also 
due to the loading and caching of data and instructions. In 
general, a more accurate representation of the effective per¬ 
formance is obtained by dropping the first measurement of 
the lot. In Fig. 1 one can appreciate how signihcantly the 
first repetition affects the various statistics, and most notice¬ 
ably, the minimum, the average and the standard deviation 
(std). 

In order to avoid the impact of “first-execution” outliers, in 
all following examples and studies we always discard the 
measurement relative to the first repetition. 

2.2 Data Placement: Varying Operands 

In Experiment 2, the matrices A, B, and C were reused 
across repetitions, causing them to stay in cache; this sce¬ 
nario is also known as a “warm data”. Depending on the 
application, the assumption of warm data may or may not 
















































be realistic; to reflect this in our experiments, we allow to 
“vary” the operands (i.e., use different memory locations) 
individually in each repetition. Furthermore, in FLAPS one 
can freely control the relative position of varying operands: 
They can be stacked horizontally or vertically, with or with¬ 
out an arbitrary offset. 


In the following experiment, while A and B are fixed and 
quite small, C varies in each repetition (hence the subscript 
in Crep) and is therefore never cached (“cold data”). 
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Experiment 3: Varying Operands 

#threads = 

1 



repeat 100 

times: 



dgemm: 

C„p 


1- B - 


B G ^ j^2000x2000 


In Fig. 2 we present the results of Experiment 3 and an¬ 
other experiment in which the matrices A, B, and C are all 
fixed. The performance loss due to the enforced out-of-cache 
scenario for C is clearly visible. 


2.3 Sequences of Kernels 

In addition to isolated kernels, FLAPS allows to experiment 
with sequences of calls. Let us use the solution of a linear 
system as an example: The problem 




( 1 ) 


is typically solved by first LU-decomposing A (dgetrf), and 
then by solving two triangular linear systems (dtrsm). The 
process—which is also implemented in LAPACK’s dgesv—is 
replicated in Experiment A} 


^For simplicity, we don’t expose the pivoting vector and omit 
the row interchanging kernel dlaswp that only contributes a 
lower order term to the execution time. 


Figure 3: Breakdown of the timings for the solution 
of a linear system 



For this experiment. Fig. 3 shows both the total execution 
time and the time spent in each individual kernel. It is easy 
to realize that for 200 right-hand sides, the LU decomposi¬ 
tion dgetrf is responsible for more than 60 % of the total 
execution time, while each of the dtrsm’s only contribute 
less than 20 %. 

2.4 Parameter Range 

So far we only considered experiments in which the sizes of 
the kernel operands where fixed. In many practical exper¬ 
iments however one wants to study the performance of a 
routine over a range of parameters. In the following exam¬ 
ple, we use the routine dgesv, which solves a linear system 
directly, to solve problems of size n with 500 right hand 
sides, where n ranges from 50 to 2000 in steps of 50. 



Performance results from Experiment 5 are shown in Fig. 4; 
the plot displays the increase in performance for increasing 










































































Figure 6: Influence of block-size on triangular inver- 
Figure 4: Solution of linear systems: performance. sion 



#threads 


Fignre 5: Scalability of LAPACK’s symmetric dense 
eigensolvers on random matrices 


problem size, as typical for dense linear algebra kernels. 


As one can appreciate from Fig. 5, FLAPS makes it easy to 
set up, execute, and compare the results of multiple experi¬ 
ments with varying degrees of parallelism. 

2.5 Sum- and OPENMP-Range 

In loop-based algorithms, the total execution time is often 
more meaningful than an iteration-by-iteration break-down. 
For this purpose, in addition to the “parameter range” de¬ 
scribed in the previous subsection, FLAPS also provides a 
“sum-range”, which yields the total contribution of the loop. 
For instance, the next experiment models the inversion of a 
lower triangular matrix^ of size 1000 via a blocked algorithm 
that traverses the matrix in steps of a fixed block-size nb. 
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Experiment 7: Triangular Inversion 


#thrGads = 1 
repeat 10 times: 

sum over j = 0 : nb: (.1000-nb) : 




AOO 


dsyrk'': 4U1 := 4lil - i-AWf^- 
dtrti2^: 411, 411,“^ 


400 G 410^3 G G : 


2.4.1 Threads Range 

Scalability studies are extremely common examples of ex¬ 
periments that make use of ranges. In the following experi¬ 
ments, we compute the eigenvalue decomposition of a sym¬ 
metric matrix (of fixed size) using from 1 up to 8 threads, 
and compare LAPACK’s solvers dsyev, dsyevx, dsyevr, and 
dsyevd (see [1] for details on these routines). 



2.5.1 OPFNMP-7?ange 

Fig. 6 reports the performance attained by this algorithm for 
different block-sizes nb; the maximum is observed for nb = 
100. The choice of parameters represents an important step 
to tailor algorithms for a given architecture; for instance, the 
tuning of the block size is common to many of the algorithms 
included in LAPACK [1,22]. Notice that a simpler and finer- 
grained experiment to optimize the block-size is obtained by 
combining the sum-range with a parameter-range for nb. 

Multi-threading can typically be exploited in two different 

^ LAPACK’s dtrtri computes the inverse of a triangular 
matrix using a similar algorithm. 

^dtrmm: Triangular matrix matrix multiplication. 

^dsyrk: Symmetric rank k update. 

®dtrti2: Unblocked triangular inversion. 
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Figure 7 : Performance of threaded dtrsm vs. parallel 
dtrsv’s 

ways, namely, either invoking a multi-threaded library (such 
as OpenBLAS), or through OpenMP. To investigate these 
alternatives, in the Experiments 8 and 9 we implement the 
solution of a triangular linear system with a tall and skinny 
right-hand side 1) with OpenBLAS’s threaded dtrsm ker¬ 
nel, and 2) as a series of parallel dtrsv’s® using ELAPS’s 
OPENMP-range. 




The results in Fig. 7 suggest that the parallel dtrsv’s are 
considerably faster than the threaded dtrsm’s, indicating 
that OpenBLAS is not optimally parallelized for such ex¬ 
tremely skewed matrix sizes. 

3. THE ELAPS FRAMEWORK 

The ELAPS framework is built to support performance ex¬ 
periments combining the features and scenarios described 
in Sec. 2. In this section, we present the structure of the 
framework, focusing on the aspects that make it general and 
intuitive, yet powerful. 

As shown in Fig. 8, ELAPS is structured in three layers. 

• The first, “bottom” layer (Sec. 3.1) is written in C/C-l—I- 
and contains the Sampler, a low-level command line 

®dtrsv: Linear system solve with a single right-hand side. 


Figure 8: Structure of the ELAPS framework 

tool responsible for executing and timing individual 
kernels. The Sampler has to be compiled for each spe¬ 
cific combination of hardware and libraries (the only 
stage in which the user needs to configure the system); 
ELAPS can interface with any number of Samplers. 

• The second, “middle” layer (Sec. 3.2) is the Python 
library elaps, which centers around the class Exper¬ 
iment that implements the previously introduced ex¬ 
periments. An Experiment can be executed on differ¬ 
ent Samplers, both locally or through job submission 
systems. The outcome is a Report, which provides not 
only structured access to the individual measurements, 
but also functionality to analyze different metrics and 
statistics. 

This layer also includes the plot module, which is 
based on the matplotlib library, and is used to easily 
visualize Reports in graphical form.^ 

• The third, “top” layer (Sec. 3.3) adds a graphical user 
interface, written in PyQt4, to both design Experi¬ 
ments in the PlayMat and study Reports and plots 
in the Viewer. 

The design of these three layers is discussed in the next 
subsections. 

3.1 The Sampler 

At the core of the ELAPS framework is a low-level perfor¬ 
mance measurement tool tailored to dense linear algebra op¬ 
erations: the Sampler. This tool, earlier versions of which 
were already utilized in [17] and [18], makes it possible to 
measure the performance of individual kernel executions, im¬ 
plementing this work-flow: 

1. Read from stdin a list of calls, i.e., kernel names with 
corresponding lists of arguments; 

2. execute the specihed calls, thereby measuring their 

performance in terms of CPU cycles, and optionally 
through performance counters provided by the Per¬ 
formance Application Programming Interface 
(PAPI) [5]; _ 

^All the plots in this paper were generated in this manner. 
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3. print the measured performance numbers to the stan¬ 
dard output. 

While reading the list of kernels from the standard input, the 
Sampler accepts several special commands: go executes, 
measures, and reports the results of all the calls previously 
read; {omp and } respectively start and end a list of calls to 
be executed as parallel OpenMP tasks; set_couiiters sets 
the PAPI counters for the next set of executions. 

The Sampler accepts kernels and arguments in a format 
that agrees with the conventions used by standard libraries 
such as BLAS and LAPACK: Each argument is passed by 
reference, and is of type char *, int *, float *, or dou¬ 
ble *. 

In a dense linear algebra kernel, these arguments are of one 
of two types. 

• Scalar arguments point to scalar values, some of which 
may influence the kernel’s behavior and control flow. 
Examples include: flag arguments (e.g., side, transA), 
size arguments (e.g., m, n), scalars (e.g., alpha, beta), 
and leading dimensions (e.g., IdA, IdB). 

Within the Sampler, scalar argument values are stored 
consecutively in an array. 

• Data arguments point to memory regions that hold 
the mathematical objects (such as vectors or matri¬ 
ces) involved in the kernel call. Generally®, the actual 
contents of these arguments do not affect the control 
flow; nonetheless, these arguments may still have a 
signihcant impact on performance, depending on their 
location in the memory hierarchy. 

The Sampler has two mechanisms to treat data argu¬ 
ments: 

— Named variables are designated memory regions 
referenced by a variable names. A set of features 
to allocate (malloc®), compute offsets (xoffset) 
and free (free) such variables give users full con¬ 
trol over where operands are stored in relation to 
each other. 

— Dynamic memory offers a fast way to pass “un¬ 
named” memory regions as data arguments. Within 
one call all such regions are guaranteed to be dis¬ 
joint; across calls, however, the same memory re¬ 
gions may be reused arbitrarily. 

To set up the contents of data arguments, the Sam¬ 
pler provides a set of simple utility-type kernels: miemset 
hlls every entry in a buffer with a single value, sgerand 
fills it with random values (uniform in ]0,1[), and xporand 
generates a random symmetric (or Hermitian) positive 
definite matrix. Furthermore, aareadf ile and awritef ile, 
read matrices from and write them to binary files, re¬ 
spectively. 

®Eigensolvers are a notable exception. 

®x £ {i, s, d, c, z} identifies the data-type. 


3.2 The elaps Package 

The middle layer of the ELAPS framework centers around 
the experimental features introduced in Sec. 2, encoded in 
the Python class Experiment. Instances of this class form 
the starting point for performance experiments; executing 
them using Samplers ultimately leads to Reports, which 
can be analyzed with respect to a variety of metrics and 
statistics. 

3.2.1 Experiments 

Experiment instances are both a static description of exper¬ 
iments, which are easily stored to and loaded from strings 
and files for portability, but also feature functionality to sup¬ 
port their design and handling. 

The kernel configurations at the center of each Experiment 
are its connection to libraries such as BLAS or LAPACK. 
While the interfaces of such libraries aim at being general by 
accommodating multiple functionalities, precisely because 
of their generality they are often unintuitive and hard to 
memorize. To counter this problem, elaps uses optional 
“Signatures” to annotate kernels, thereby providing possible 
value ranges and semantic connections between arguments. 
In the end, these Signatures allow Experiments to expose 
feasible values for arguments (such as trans or uplo) and 
automatically derive connected arguments such as operand 
sizes and leading dimensions, both within a single kernel and 
across multiple kernels. 

The execution of an Experiment is initiated by the submit 
method. This method first generates the sequence of kernel 
calls for the Sampler and a shell script for its execution; it 
then either executes this script is locally or submits it to a 
batch job system. 

3.2.2 Execution on Samplers 

In this section, we describe how the Experiment features are 
translated into commands for the Sampler. 

As input, the Sampler expects a raw list of kernel invoca¬ 
tions. To produce this list, all ranges and repetitions in an 
Experiment are completely unrolled, thereby evaluating any 
symbolic (range-dependent) variable. The OPENMP-range 
is translated directly to the Sampler’s {omp and } com¬ 
mands. Using the parameter-range to vary the number of li¬ 
brary threads requires to interface with said library; to avoid 
library-dependent kernels and Sampler features, we do so 
through environment variables (e.g., OPENBLAS_NUM_THREADS) 
and by starting the sampler separately for each thread count. 

Data arguments in kernels are allocated as named variables 
in the Sampler at the beginning of the input. Arguments 
that vary with repetitions or the sum/OPENMP-range (i.e., 
they point to different locations) are first allocated as a sin¬ 
gle large block and then subdivided by calculating appropri¬ 
ate offsets, resulting in individual variables for each repeti¬ 
tion and range iteration. 

Finally, PAPI counters are also set at the beginning through 
the set_couiiters command. 

3.2.3 Reports 



Each Experiment execution results in a report file that, when 
read into elaps, turns into a Report instance. This object 
serves as a structured representation of the obtained mea¬ 
surements with respect to the underlying Experiement: Raw 
measurements are accessed through the hierarchy “parameter- 
range value —>■ repetition —>■ sum/OPENMP-range value —^ 
kernel” and yield the cycle count or PAPI counter measure¬ 
ments. Separately, a “reduced” view on the results accumu¬ 
lates the sum/OPENMP-range and the kernels according to 
the experiment semantic. 

To turn these structured yet raw measurement results into 
more meaningful quantities, metrics combine them with the 
kernels’ flop counts and information on the their execution 
environment. The easily extensible set of metrics ranges 
from “execution time in seconds” to “Gflops/s” and “effi¬ 
ciency”. 

While a metric converts measurements values one-by-one, 
results from multiple repetitions are combined by statistics, 
such as “minimum”, “maximum”, “median” or “standard de¬ 
viation”. As motivated in Sec. 2, the results from first repe¬ 
titions are optionally discarded to hide overhead effects and 
make statistics more representative of in-application invoca¬ 
tions. 

3.2.4 Plotting 

ELAPS’s plot module generates matplotlib figures from 
the structured data in Reports under consideration of both 
metrics and statistics. Depending on the type of experiment, 
it automatically generates appropriate bar- or line-plots that 
are easily exported to various file formats. 

3.3 The Graphical User Interface 

ELAPS’s Experiment and Report Python classes establish 
a flexible and powerful foundation for performance evalua¬ 
tions at a scripting level. In order to enable performance 
experimentation in an explorative fashion, and to facility 
more intuitive interaction, ELAPS features a graphical user 
interface. As shown in Fig. 8, this interface consists of com¬ 
ponents: While the PlayMat serves as a “playing mat” to 
develop Experiments, the Viewer interactively visualizes 
Reports. 

The PlayMat, shown in Fig. 9, allows interactive access to 
the full functionality of an Experiment. To further guide 
the user, among other things, it visualizes the Experiment’s 
kernels based on their Signatures similar to how they are 
presented in this paper and automatically calculates matrix 
sizes and (if desired) deducible arguments, such as leading 
dimensions. It furthermore provides progress tracking of 
executing Experiments and can load completed Reports di¬ 
rectly in the Viewer. 

The Viewer, shown in Fig. 10, is an interactive means to 
analyze and compare Reports. It provides all metrics ap¬ 
plicable to the selected Reports and its statistical plots are 
easily manipulated and exported. 

4. APPLICATION EXAMPLES 

In this section, we demonstrate the use of the ELAPS frame¬ 
work in several application examples. For this purpose, we 
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Figure 9: The PlayMat through Xll. 



Figure 10: The Viewer on Mac OS X 


deliberately chose a wide range of hardware systems and 
kernel libraries. 


4.1 Algorithm Selection: Tensor Contractions 

Let us consider the tensor contraction (in Einstein notation) 

Cabc ^akPkcb ( 2 ) 

with A G R1250x 750^ ^ ^ j^750x500xn^ (j ^ j^l250xnx500^ 

where n is between 100 and 10 000. Using an explicit index 
notation, C is computed as 

ya,\/b,\/c. C[a,b,c] ■= A[a, fc] • B[k,c,b] (3) 

k 

and can be visualized as follows: 


ffl 




C ■.= <2 

A 

k 

B 




c ^ 

b 

k 




(n = 100) 


/ / 



C 

;— a 

1 

A 


(n = 1000) 



A natural approach to efficiently compute such tensor con¬ 
tractions is to utilize the highly optimized dgemm kernel. For 




























































Contraction (2), there are two ways of casting the computa¬ 
tion as a series of dgemm’s: 





b 


(4) 


VcG 500} : 




An inspection of these algorithms reveals that they both ex¬ 
ecute a dgemm of fixed size on varying data; in particular, the 
number of invocations in Algs. V6 and Vc are, respectively, 
n and 500. By virtue of this observation, Experiments 10 
and 11 only perform 10 repetitions, thus reducing the ex¬ 
perimentation time; while the results will not be meaningful 
estimates for the total execution time, they will expose the 
same computational efficiency (expressed in Gflops/s) as the 
full algorithms. Furthermore, since Alg. V6 operates on ma¬ 
trices of fixed size and independent of n, in Experiment 10 
we also avoid the use of the parameter range. 


PowerPC A2 | ESSL | 


Experiment 10: Tensor algorithm Vb 

#threads = 64 




repeat 10 times: 



dgemm : ( 

Ira, 


A 

6 

^ ^ j^l250x750 

Brrap G Crap G RI^^OXSOO 


PowerPC A2 | ESSL | Experiment 11: Tensor algorithm Vc 

#threads = 64 
for n = 100:10:1000: 
repeat 10 times: 







^4-rep 


A 


Brep 


A g r1250x 750^ g g750X^^ g gl250x 


We perform Experiments 10 and 11 on a 16-core IBM Pow¬ 
erPC A2 node of the IBM BlueGene installation JUQUEEN 
at the JuLiCH Supercomputing Center, linked to IBM’s 
optimized ESSL library. Only the Sampler is executed on 
this compute node, running the lightweight CNK operating 
system; it is accessed by elaps from a Red Hat Enter¬ 
prise Linux 6.6 front-end node through the LoadLeveler 
batch job system. 

Fig. 11 suggests that neither of the two algorithms is optimal 
for all cases: While for a small dimension n, algorithm Vb 
is better, algorithm Vc dominates for large n. Interestingly, 
the crossover point is not at n = 500, where both algorithms 
work with matrices of equal size, but already around n = 
300. 

4.2 Library Selection: Sylvester Equation 


Figure 11: Comparison of dgemm-based algorithms for 
tensor contraction 2 


Choosing the right library can be a crucial step in attain¬ 
ing high performance. In this section, we demonstrate its 
importance by considering the triangular Sylvester equation 



( 6 ) 


to be solved for X, which is central to problems in control 
theory. In addition to LAPACK’s dtrsyl, several other li¬ 
braries offer this kernel with the same interface. In our tests, 
we consider 


• LAPACK 3.5 [1] linked to OpenBLAS 0.2.14 [16], 

• RECSY 0.01 [14] linked to OpenBLAS, 

• libFLAME^® 5.1.0-18 [21] linked to OpenBLAS, and 

• Intel’s commercial MKL 11.0 [13]. 


To compare these libraries, we launch Experiment 12 on a 
10-core Intel IvyBridge E5-2680 v2 processor with Sam¬ 
plers linked to the above libraries. This machine, which 
is part of a compute cluster is accessed through the Plat¬ 
form LSF 9.1.2 batch job system with both the front-end 
(elaps) and back-end (the Samplers) running Scientific 
Linux 6.6. 


E5-2680 v2 | see above 

Experiment 12: 

Sylvester solver 

#threads = 10 

for i = 100:100:6000: 
repeat 10 times: 

dtrsyl: 

C 

:= Sylv ^ A , 


A G R‘X/ B G 

i / 2 ^ 

C G R‘X»/2 



Fig. 12 shows the performance attained by the different li¬ 
braries. LAPACK, which only provides an unblocked algo¬ 
rithm for dtrsyl, reaches 2 Gflops/s for small problems but 

^°While libFLAME’s LAPACK interface does by default not 
call its optimized Sylvester solver, it is easily exposed. 













































































































Figure 12: Comparison of libraries for the triangular 
Sylvester equation 


Figure 13: Multi-threading paradigms for a se¬ 
quence of LUs^® 


eventually falls below 1 Gflops/s. The specialized RECSY li¬ 
brary on the other hand attains the best performance of up 
to 24.5 Gflops/s. libFLAME is initially competitive with 
REGSY but eventually tops at 20.5 Gflops/s. Surprisingly, 
the otherwise very efficient MKL seems poorly optimized for 
this problem and is as fast as LAPACK. 

4.3 Multithreading: Sequence of LUs 

In a certain type of electronic structure calculations [2] , one 
has to solve a series of fairly small linear systems. As already 
mentioned, a possible approach involves the LU factorization 
(dgetrf) and two triangular linear systems (dtrsm) for each 
matrix; since each system only involves two right-hand sides, 
the cost for the dtrsm’s is entirely negligible. 

On multi-core machines, the sequence of LUs can be paral¬ 
lelized at the granularity of one matrix, via a multi-threaded 
dgetrf kernel, or by assigning different matrices to different 
threads, via OpenMP (hybrid solutions are also possible). 
Performance-wise, there are arguments both in favor and 
against each of these two alternatives: Using BLAS’s inter¬ 
nal parallelism ensures that only one kernel uses the CPU’s 
caches at a time; on the other hand, OpenMP ’s parallelism 
increases the amount of work that the CPU’s cores can per¬ 
form simultaneously. 

The next three experiments use ELAPS’s sum-range and 
OPENMP-range constructs to model the scenarios in which 
1) a multi-threaded kernel is used, 2) OpenMP runs sequen¬ 
tial kernels in parallel, and 3) OpenMP runs multi-threaded 
kernels in parallel.Each of them measures the time it takes 
to LU decompose an increasing number of square matrices 
of size 800. 




Fig. 13 indicates that of the two “pure” approaches, if more 
than 8 LU’s are performed (i.e., more than the CPU has 
hardware threads), OpenMP with single-threaded kernels 
outperforms Accelerate’s parallel kernel. However, the 
mixed approach, in which Accelerate uses up to 8 threads, 
while OpenMP is allowed to schedule the LU decomposition 
tasks, is even more efficient, reaching up to 75 Gflops/s. 

4.4 Algorithmic Optimization: GWAS 

Genome Wide Association Studies (GWAS) investigate how 
human traits (e.g. eye color or genetic deceases) are related 
to certain locations in the human genome [4,24]. Compu¬ 
tationally, GWAS can be cast as a sequence of Generalized 
Least Squares (GLS) problems 


These three experiments are executed on a MacBook Pro 
running OS X 10.9.4 with a quad-core Intel Haswell i 7- 
4850HQ CPU (Turbo Boost disabled) using Apple’s Ac¬ 
celerate framework; both the Sampler and elaps run on 
the same platform. 

'^^This third experiment is not displayed: It is obtained from 
Experiment 14, changing #threads to 8. 



(7) 


Even though the Turbo Boost was disabled and no other 
application was running, noticeable performance fluctua¬ 
tions were observed. This is to be expected on laptop- 
systems. 
























































Figure 14: Timing breakdown for a sequence of GLS 
solves 


where M £ is symmetric positive definite, Xi £ R"^^, 

y £ R”, and bi £ R^ with 1000 < n < 5000, p < 20 and 
i £ {1,... m}, where m can be in the millions. 

A straightforward implementation of this equation (e.g., us¬ 
ing R or Matlab) might compute each bi individually by 
solving Equation 7 from right to left, as modeled in this 
next experiment (with n = 1000 and p = 4): 



For Experiments 15 and 16, we choose a 60-core Intel Xeon 
Phi co-processor using Intel MKL. ELAPS’s python library 
and the PlayMat are run on this system’s Host processor 
(Scientific Linux 6.6); only the Sampler is executed na¬ 
tively on the co-processor. 

Fig. 14 shows both the execution time of Experiment 14, as 
well as a breakdown thereof. The runtime is clearly domi¬ 
nated by the dposv and dpotrs kernels involving M. 


'^^dposv: Cholesky decomposition + linear system solve. 
^®dgemv: General matrix vector product. 

'^^dpotrs: Linear system solve following a Cholesky decom¬ 
position. 


From a first analysis of these two kernels, one realizes that 
the dposv (y := M~^y) is independent of i, and can thus be 
taken out of the loop and computed just once. This modifi¬ 
cation reduces dposv’s contribution to the total runtime by 
a factor of m, effectively shifting the bottleneck onto dpotrs. 

A further analysis reveals that all the dpotrs linear systems 
involve the same matrix M; this observation suggests to 
combine the right-hand sides Xi £ for all m iterations 

into a single large matrix X £ 'pjjg following 

experiment solves a linear system with this matrix as the 
right-hand side with a single invocation of dpotrs: 



For the selected range of m, the runtime of this experiment 
is below 200 ms, i.e., already more than 1 order of magnitude 
less than the previous experiment. Furthermore, for larger 
problems, the dpotrs kernel can make good use of the co¬ 
processor’s many cores and reaches over 550Gflops/s. For 
an in depth study of performance optimizations for GWAS 
we refer the reader to [?] and [9]. 

5. CONCLUSION 

We introduced the Experimental Linear Algebra Performance 
Studies framework (ELAPS), a set of tools and features 
to design, execute, measure and analyze dense linear al¬ 
gebra performance experiments. With its intuitive inter¬ 
face, ELAPS assists users in investigating performance be¬ 
haviors and in making informed decisions. Throughout this 
paper, we applied ELAPS to a wide range of scenarios, in¬ 
cluding parameter optimization (Sec. 2.5), algorithm selec¬ 
tion (Sec. 2.4.1, Sec. 4.1), library comparison (Sec. 4.2), and 
parallelism and library threading (Sec. 2.5.1, Sec. 4.3). We 
demonstrated the framework’s flexibility by linking it with 
seven kernel-libraries, and by executing experiments on five 
different platforms, including the Xeon Phi co-processor and 
two different batch-job systems. In summary, ELAPS cov¬ 
ers many aspects of shared-memory optimizations, a critical 
step towards achieving large-scale performance. 

Having established the foundations of a framework for rapid 
experimentation, we foresee many opportunities for exten¬ 
sions. In particular, we envision 1) coverage of a broader 
range of architectures, including GPUs and ARM-based CPUs, 
2) support for metrics related to data movement and energy 
consumption, and 3) interfaces for distributed memory li¬ 
braries such as ScaLAPACK. 

Software 

ELAPS is open source (BSD license) and available on GitHub: 
http://github.com/elmar-peise/ELAPS 
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